1 Introduction

Project EDIFES now has 779 cleaned building datasets ingested into HBase and (around) 20 working building markers. The purpose of this report is to document preliminary work towards a complete cross-sectional study of all buildings and markers.

1.1 Data Characteristics

  • Total of 779 buildings
  • 209 buildings have square feet readings
    • 92 of these are ‘accurate’ and 117 are estimated
  • 424 buildings have a standardized building type
  • 155 buildings have number of floors information

1.2 Markers

Markers constantly change and results presented here reflect markers as of 12/1/2017.

  • Markers run on all building are as follows
    • Heating Type
    • EUI (where applicable)
    • Effective Thermal Resistance (where applicable)
    • Base-to-Peak Ratio
    • Summary Statistics
    • Data Quality Check
    • HVAC Size
    • HVAC Schedule
  • Additional analysis covers
    • Climate zone (kgcz and ASHRAE) distribution
    • Annual Consumption
    • Correlations between weather conditions and electricity consumption

2 Building Types

Results were obtained via the StandardizeBuildingTypes function written by Shreyas Kamath.

Standardized Building Types
type count
Banking 1
Educational 74
Entertainment 10
Food Sales & Service 34
Healthcare 33
Industrial 32
Office 41
Other 25
Public services 15
Retail 106
Services 25
Skyscraper 15
Storage 7
Utilities 6

Completely pointless wordcloud of building types.

3 Climate Zone Distribution

Results were obtained via functions I wrote for kgcz (mrk-climate_identifier.R) and ASHRAE (mrk-get_ashrae_cz.R) climate zone identification from latitude and longitude. The KGCZ is based on latitude and longitude with 0.5 degree precision, and the ASHRAE climate zone is based on querying the United States Census Bureau API to retrieve the county and matching that to a list of counties and climate zones.

3.1 Koppen Geiger Climate Zones

Image to orient ourselves.

3.1.1 Table of KGCZ climate zone counts

KGCZ Counts
kgcz count
BSh 10
BSk 14
BWh 1
BWk 1
Cfa 424
Cfb 14
Csa 17
Csb 54
Dfa 163
Dfb 78

3.1.2 Plot of KGCZ Distribution

  • 75% of building occurs in just 2 climate zones, Cfa and Dfa.
  • 4 climate zones have over 30 buildings (level of statistical significance according to the central limit theorem)

3.2 ASHRAE Climate Zones

Image to orient ourselves

3.2.1 Table of ASHRAE Climate Zones

ASHRAE CZ Counts
a_cz count
2A 5
2B 2
3A 2
3B 32
3C 58
4A 425
4C 6
5A 242
5B 1

3.2.2 Plot of ASHRAE Climate Zone Distributions

  • 85.6% of buildings are in two climate zones, 4A and 5A
  • 4 climate zones with more than 30 buildings

4 Data Quality

Data quality check building currently does not work for 35 datasets (all those with 1 minute interval data).

  • 4 letter grades represent quality before and after cleaning
  • These statistics were not computed on the actual raw data for these datasets

4.1 Data Quality standards

4.2 Plot of All Quality before and after

4.3 Data Quality Changes through Cleaning

This shows all changes with more than 5 occurrences.

4.4 Data Quality by Sample Set

Now, to highlight the datasets that were not AAAP.

Non AAAP Quality before Cleaning
Sampleset AAAF AABF AABP AACP AADP BAAF BAAP BADP CAAP CADP DAAP
CWRU 0 0 0 1 6 0 0 0 0 0 1
FirstE 0 0 1 0 2 0 0 0 0 0 2
FirstE_ex 0 0 0 0 0 0 11 0 18 0 0
JCI2 1 1 5 0 2 0 1 0 1 0 2
JCI2_ex 1 0 2 0 1 0 30 2 28 1 6
KSU 0 0 0 0 0 0 0 0 0 0 1
Prog 2 0 1 0 0 1 0 0 0 0 0
Schools 0 0 0 0 0 0 1 0 0 0 0
Starbucks 0 0 0 0 0 0 0 0 0 0 0
Non AAAP Quality after Cleaning
Sampleset AAAF BAAF BAAP CAAF CAAP DAAF
CWRU 1 0 0 0 0 1
FirstE 2 0 0 0 0 1
FirstE_ex 7 4 2 6 2 2
JCI2 4 0 0 0 0 0
JCI2_ex 18 5 4 5 2 7
KSU 1 0 0 0 0 0
Prog 2 1 0 0 0 0
Schools 0 1 0 0 0 0
Starbucks 0 0 0 0 0 0

4.5 Plots of data quality by sample set

5 Heating Type

5.1 Heating Type by Location

In the following map, the Starbucks locations are the stars. These demonstrate considerably different behavior than the other buildings which might explain why they are classified as non-electrical heating despite being located in the Southwest.

5.2 Heating Type Determination

The current process to determine the heating type is as follows

  • Data subset to business days between 7 am and 7 pm
  • Linear model between energy use and temperature
  • Slope is extracted from the model
  • Cut point is set at a slope of 0
  • Negative slope indicates electrical heating
  • Positive slope indicates non-electrical (gas or no?) heating
  • Winter and summer temperatures determined from changepoint analysis

Example plot of heating type. Conclusion from this plot is electrical heating.

Everything to the right of the black vertical line in the following plot is classified as non-electrical heating while everything to the left is classified as electrical.

We can segment the plot to between -0.1 and 0.1 because the majority of slopes fall in that range.

The question is where to draw the line for electrical heating. Currently the cut point is at a slope of 0, but this might need to be adjusted or we need to use a different method.

Thoughts?

6 Annual Consumption

6.1 Annual Consumption in kWh

This is annual consumption for the most recent year.

6.1.1 Summary Statistics

Annual Consumption Stats
stat
Min. 6.118e+03
1st Qu. 1.522e+06
Median 2.349e+06
Mean 1.570e+07
3rd Qu. 9.013e+06
Max. 1.446e+09
NA’s 9.000e+00

6.1.2 Histogram of Annual Energy Use

The red vertical line is the median at 2.349e6 kWh per year.

7 Energy Use Intensity, Effective R Value

The energy use intensity (EUI) of a building is meant to compare buildings of different sizes in the same climate zone and of the same type. The EUI is designed to normalize the size of a building and EUI values differe significantly across industries.

  • A lower EUI value indicates a more efficient building for its class

The effective thermal resistance (r-value) of a building is a measure of the buildings resistance to heat loss. It can be used as a measure of the ‘tightness’ of a building’s insulation.

  • A higher R value indicates a better insulated and sealed building

As long as we have the accurate sqaure footage and energy consumption, we should get the correct EUI value. However, the effective thermal resistance requires more complex thermodynamic analysis.

7.1 Energy Use Intensity

201 buildings have an energy use intensity value.

EUI Stats by Building Type
buts count mean median sd reference
1 Educational 28 2139.41 45.28 7933.45 73.10000
2 Entertainment 1 60.95 60.95 NaN 44.78750
3 Food Sales & Service 19 275.95 238.76 98.73 229.85000
4 Healthcare 10 330.93 123.85 465.56 97.92857
5 Industrial 20 1547.27 294.58 5470.15 NaN
7 Office 37 144.53 51.84 238.30 67.30000
8 Public services 2 78.91 78.91 4.22 NA
9 Retail 67 99.48 52.95 163.18 103.18333
10 Skyscraper 3 48.41 49.77 13.79 NaN
11 Storage 6 109.48 96.18 69.76 58.20000
12 Utilities 3 3545.07 2960.05 3557.68 40.69000

A good visualization for the distribution of EUI values is a boxplot. I filtered out EUI values greater than 500 which are highly suspect.

7.2 Effective Thermal Resistance

To give a sense of context, vacuum sealed panels, the top of the line insulation, have an effetive r-value of 50 hr F ft^2 / BTU.

In his thesis, Aaron cites a paper (Nordstrom et al. 2013[^1]) that examined R - values from 6 residential buildings in Sweden built from the 1960s to 2006 to validate the results he obtained from his function. The paper reports R - values between 9.1 to 23.7 hr F ft^2 / BTU.

![^1]G. Nordström, H. Johnsson, and S. Lidelöw, “Using the Energy Signature Method to Estimate the Effective U-Value of Buildings,” in Sustainability in Energy and Buildings, Springer, Berlin, Heidelberg, 2013, pp. 35–44.
Effective Thermal Resistance Stats by Building Type
buts count mean median sd
Educational 28 387.60 222.95 514.17
Entertainment 1 176.54 176.54 NaN
Food Sales & Service 19 5.62 5.58 2.21
Healthcare 10 68.08 56.92 53.66
Industrial 20 43.70 31.06 61.51
None 8 14.90 3.37 29.45
Office 40 137.66 27.51 262.82
Public services 2 111.10 111.10 54.98
Retail 67 234.10 268.99 130.94
Skyscraper 3 264.32 250.00 252.83
Storage 6 117.12 89.58 80.77
Utilities 3 10.45 0.95 17.05

Boxplots of Effective Thermal Resistance. The red vertical lines indicate the typical range as reported in the paper.

This function needs some work, and I plan on addressing it over winter break. It is based on a thermodynamic model as documented by Aaron in his paper. Professor Abramson has validated the method, but the implementation might need an adjustment. Any ideas would be appreciated.

8 Weather Correlations

The Pearson correlation coefficient measures the strength and direction of a linear relationship between two variables. The following plots show the Pearson corelation coefficient between weather variables and energy consumption.

  • Data subset to operating hours (7 am to 7 pm)
  • Winter: December, January, February
  • Summer: June, July, August
  • Climate zones with only a single building removed

8.1 Boxplots of Correlations

8.2 Heatmaps for Climate Zones and Weather Variables

The following heatmaps show the average correlations between weather conditions energy consumption by climate zone. The dendograms cluster similar weather conditions and similar climate zones.

9 Base Peak Ratio

The base to peak ratio is the average base load divided by the average peak load. This marker is segmented by winter and summer and by year so we can look at changes between the seasons as well as changes over the years.

  • Base peak ratio > 0.30 indicates an opportunity for savings by reducing the base load.

We can first look at the base to peak ratio statistics for the final year by sample set. These tables are grouped by season and arranged from lowest (best) to highest (worst) base to peak ratio.

  • pct_savings indicates the percentage of buildings in the sample set that can save by reducing baseload.

9.1 Statistics by Sample Set

Winter Base Load Stats by Sample Set
styp season count mean median sd pct_savings
Starbucks winter 19 0.2657895 0.250 0.0623891 0.2105263
Schools winter 2 0.2950000 0.295 0.0212132 0.5000000
Prog winter 8 0.4325000 0.390 0.1661110 0.8750000
JCI2_ex winter 169 0.5182840 0.460 0.1991410 0.9112426
FirstE_ex winter 30 0.5310000 0.510 0.2310299 0.8000000
KSU winter 14 0.5585714 0.550 0.1037643 1.0000000
JCI2 winter 357 0.5904482 0.560 0.1992030 0.9551821
CWRU winter 37 0.6183784 0.660 0.1815598 0.9729730
FirstE winter 137 0.6649635 0.710 0.2002209 0.9124088
Summer Base Load Stats by Sample Set
styp season count mean median sd pct_savings
Starbucks summer 19 0.2878947 0.270 0.0686034 0.3157895
Prog summer 7 0.3242857 0.330 0.0877225 0.7142857
Schools summer 2 0.3300000 0.330 0.1555635 0.5000000
JCI2_ex summer 169 0.4682249 0.410 0.2236237 0.7869822
KSU summer 14 0.4835714 0.425 0.1726093 0.9285714
FirstE_ex summer 30 0.4923333 0.425 0.2430791 0.7666667
JCI2 summer 357 0.5435854 0.510 0.2199234 0.8711485
CWRU summer 37 0.5929730 0.620 0.2086713 0.8918919
FirstE summer 137 0.6018248 0.630 0.2133770 0.8978102

We can also look at boxplots for each sampleset. The blue vertical line indicates the threshold established for savings opportunities.

9.2 Summer Versus Winter Ratio

As a sanity check, we can look at a plot showing the relationship between the ratio during the summer and winter. We would expect this to be a positively linear relationship.

9.3 Yearly Changes in Ratio

The base to peak ratio is calculated for each year, so we can look at the changes over the years to see which buildings are improving.

  • Change is defined as oldest ratio - most recent ratio
  • Positive change indicates reduction in ratio
  • Calculated only for buildings with more than one year of base load data

Most buildings change relatively little, if at all over the years. Again, we can compare seasons to see if there is a correlation between the summer change in base peak ratio and winter change in ratio.

Ideally, a building would be in the upper right quadrant, with positive changes in both summer and winter.

Improvement Percent by Sample Set
styp season imp_pct
CWRU summer 0.2500000
CWRU winter 0.3888889
FirstE summer 0.4104478
FirstE winter 0.3955224
FirstE_ex summer 0.5333333
FirstE_ex winter 0.5000000
JCI2 summer 0.3878788
JCI2 winter 0.3878788
JCI2_ex summer 0.4750000
JCI2_ex winter 0.4750000
KSU summer 0.4615385
KSU winter 0.0769231
Prog summer 0.6000000
Prog winter 0.8000000
Schools summer 0.5000000
Schools winter 0.5000000
Starbucks summer 0.7333333
Starbucks winter 0.6000000

10 HVAC Schedule

The HVAC schedule function finds the most likely turn on and turn off times for business and non-business days.

10.1 HVAC Schedule by Sample Set

First, we can look at business day turn on and turn off times by sample set.

One other thing to look at is typical length of operating day.

Average HVAC Schedule by Sample Set
styp mean_on mean_off hours
sampleset10 4.625 16.500 11.875
sampleset2 8.914 19.442 10.527
sampleset3 8.605 19.404 10.799
sampleset4 7.897 18.222 10.325
sampleset5 5.250 18.344 13.094
sampleset6 5.911 16.714 10.804
sampleset7 3.882 21.763 17.882
sampleset8 5.800 18.683 12.883
sampleset9 7.528 14.847 7.319

11 Correlations

We can make some correlation plots to determine relationships that exist between building markers. The quantitative numbers can also be printed to look at the possible trends.

Correlation Matrix
eui r summer_ratio winter_ratio log_annc hours
eui 1.0000000 -0.4132707 0.1996727 0.0775484 0.0839891 0.0186001
r -0.4132707 1.0000000 -0.2086206 -0.2107270 -0.0942968 -0.0588777
summer_ratio 0.1996727 -0.2086206 1.0000000 0.8324121 0.4420122 -0.3586755
winter_ratio 0.0775484 -0.2107270 0.8324121 1.0000000 0.5863760 -0.3626265
log_annc 0.0839891 -0.0942968 0.4420122 0.5863760 1.0000000 -0.2605309
hours 0.0186001 -0.0588777 -0.3586755 -0.3626265 -0.2605309 1.0000000

11.1 Pairwise Correlations

Another good option is to make pairwise plots. The diagonals show the distribution of the variable, and in the second plot, the asterisks indicate the significance of the relationship.

12 Conclusions

  • Distribution across climate zones (kgcz and ashrae) is extremely skewed
    • 75% of buildings in 2 kgcz and 85% in 2 ashrae cz
    • More than 30 buildings in 3 different climate zones
  • Limited number of building areas and building types
    • Accuracy is also an issue
    • Shreyas is working on finding more but these are all estimates
  • Data cleaning improves the quality of data
    • Most common change: BAAP -> AAAP
  • Heating type seems suspect based on geographical distribution
    • Could potentially be an effect of building type
    • Might need to rethink methodology/cutpoint for heating type
  • Effective Thermal Resistance needs to be modified (I’m on it!)
    • Values are above any reasonable level as established in the literature
  • EUI Values are reasonable (we can’t get this wrong!)
  • Temperature is highly positively correlated with energy consumption during the summer
    • GHI also positively correlated, relative humidity negatively correlated
  • No significant winter weather correlations when average across climate zones
  • 91.3% of buildings have base peak savings opportunities during the winter
  • 83.9% of buildings have base peak savings opportunities during the summer
  • Buildings with more than 1 year: 45.5% improved base-peak ratio, 43.7% got worse, (rest had no change)
  • HVAC schedule produces reasonable results
    • Starbucks begins much earlier than other companies
    • Most buildings operate the HVAC between 9.5 and 12.5 hours per day
  • Summer base peak ratio and winter base peak ratio are highly correlated with r = 0.84
    • This is expected and provides confidence in base peak ratio calculation
  • EUI and R value are negatively correlated with r = - 0.38
    • This is expected because better insulated buildings should have a lower energy use per area
  • Summer base peak ratio and winter base peak ratio are mildy correlated with the log of annual consumption

13 Prediction Method

In order to meet ARPA-E milestone 4.1.1, we need to develop a predictive model that achieves an adjusted R2 greater than 0.85 when predicting six months. I wanted to test a random forest regression model for predictive capability. The details of the Random Forest are presented below, but the summary is the Random Forest is an extremely powerful model that maintains a level of interpretability.

13.1 Random Forest Description

The original paper describing Random Forests is by Leo Breiman.

To understand the powerful random forest, you first need to grasp the concept of a decision tree. The best way to describe a single decision tree is as a flowchart of questions about the variable values of an observation that leads in a classification/prediction. Each question (known as a node) has a yes/no answer based on the value of a particular variable. The two answer form branches leading away from the node. Eventually, the tree terminates in the final classification/prediction node called a leaf. A single decision tree can be arbitrarily large and deep depending on the number of features and the number of classes. They are adept at both classification and regression and can learn a non-linear decision boundary (they actually learn many small linear decision boundaries which collectively are non-linear). However, a single decision tree is very prone to overfitting, especially as the depth increases. The decision tree is flexible leading to a tendency to simply memorize the training data. To solve this problem, ensembles of decision trees are combined into a powerful classifier known as a random forest. Each tree in the forest is trained on a randomly chosen subset of the training data (either with replacement, called bootstrapping, or without) and on a subset of the features. This increases variability between trees making the overall forest more robust and less prone to overfitting. In order to make predictions, the random forest passes the features (values of variables) of the observation to all trees, and takes an average of the votes of each tree (known as bagging). The random forest can also weight the votes of each tree with respect to the confidence the tree has in its prediction. Overall, the random forest is fast, relatively simple, has a moderate level of interpretability, and performs extremely well on both classification and regression tasks. The random forest should be one of the first models tried on any machine learning problem and is generally my second approach after a linear model. There are a number of hyperparameters that must be specified for the forest ahead of time with the most important the number of trees in the forest, the number of features considered by each tree, the depth of the tree, and the minimum number of observations permitted at each leaf of the tree. These can be selected by training many different models with varying hyperparameters and selecting the combination that performs best on cross-validation or a testing set. A random forest performs implicit feature selection and can return the relative importances of the features so it can be used as a method to reduce dimensions for additional algorithms.

A simplified model of a decision tree used for exactly this task is presented below

13.2 Methodology

In order to test the accuracy of the method, I trained the model on all data except for the final six months. I then took the final six months of data and made predictions for the electricity consumption. These predictions were compared to the known true values to assess the predictive capabilites of the random forest. This procedure was then completed for all buildings in HBase.

13.3 Results

The prediction capabilities of the model have been tested against all buildings in HBase with fewer than 1 million datapoints. The average runtime to train and predict for a building is around 4 minutes (depending on number of datapoints).

We can plot the rsquared and mape values to determine the performance. Around 25% of buildings currently have an r-squared above the requirement.

13.3.1 Typical Predictions

The following are predictions made for the Progressive APS building in Phoenix, Arizona. The rsquared value for these predictions was 0.933.

14 Animations

Animated graphs as a good way to highlight changes over time and can compress a considerable amount of information into a single visual. I am not sure if these will be useful, but at the least they are fun to make!

14.1 Weekly Seasonal Patterns

14.2 Daily Patterns

14.3 Year Long